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Annual Report of Research Progress on 
Supercomputers for Solving PDE Problems * 

Dr. Kai Hwang 
Computer Research Institute 
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Our research achievements in this reporting period include several related research 
topics. We have investigated the mapping of PDE algorithms onto various multi-processor 
architectures [1,2]. A language construct for developing parallel programs is proposed [3], 
Other efforts include the domain decomposition approach to solving PDE problems [4] and 
efficient preprocessing and postprocessing in finite element analysis [5], All the work was 
aimed at boosting the speed at which large-scale PDE problems can be solved via the use 
of parallel computers. 

Mapping of parallel algorithms for solving PDE problems onto Orthogonal Multipro¬ 
cessor (OMP), a multiprocessor architecture conceived at USC, has been investigated in 
depth. Specifically, two methods, SLOR and ADI, are mapped onto the architecture. An¬ 
alytical results show that linear speedup with the number of processors can be achieved 
by a proper distribution of data in the memory modules. 

Mapping of multigrid algorithms is examined in the context of four classes of multicom¬ 
puter architectures, namely, trees, hypercubes, meshes, and the OMP. Different mapping 
strategies are presented and analyzed in terms of load balance achieved and communica¬ 
tion penalty paid in each case. Extensive comparisons have been conducted to provide 
useful guidelines in the selection of suitable mapping strategies for different architectures. 

Molecule language is proposed to bridge the gap between the development of hardware 
and software supports for parallel computers. It provides syntax and semantics rules which 
allow the user to specify the desired computation modes that best match problem charac¬ 
teristics. Such a concurrent language approach is instrumental to the effective solution of 
PDE problems on supercomputers. 

Other related research results have also been reported on further development and 
potential optical implementation of pipeline nets [6] used in the Remps architecture [7]. 
Trends of parallel processing, including recent advances in optical and neural computing, 
and their prospective applications to PDE solutions are summarized in [8]. 
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• An Orthogonal Multiprocessor (OMP) ar¬ 
chitecture is developed for solving PDE prob¬ 
lems using the SLOR and ADI methods. 


• Linear speedup can be achieved with OMP 
architecture, on which the SLOR and ADI 
methods are partitioned for parallel process¬ 
ing. 


• A V-Tree Multiprocessor is suggested for par¬ 
allel implementation of the V-cycle in multi¬ 
grid algorithms. 
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The OMP Architecture and Orthogonal Memory Accesses: 

• The Bus Controller enables either row memory accesses (using 
the row buses) or column memory accesses (using the column 
buses) but not both at the same time. 

• These orthogonal memory access patterns avoid conflicts com¬ 
pletely and. therefore, achieve full memory bandwidth. 
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The SLOR Method on OMP: 


* The grid points are evenly distributed into 
two subsets by alternate lines (Fig. 2). 

• Each iteration requires 0(k*/n) time on an 
OMP with n processors, where /c*/c is the grid 
size. Note that the same problem requires 
U\k 2) time on a uniprocessor system. 
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Figure 2. The row distribution for either the SLOR method or 

ADI method 



























The ADI Method on OMP: 


The grid points are distributed to the row 
memory and column memory ( Fig. 2 and 
Fig. 3). 

Each iteration of the ADI method on a grid 
of k*k points can be done in 0(fc 2 /n) time on 
an OMP with n processors. A linear speedup 
is achieved compared with a uniprocessor. 
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Figure 3. The column distribution for the ADI method 















Implementing Multigrid Algorithms on a 

V-tree Architecture: 


Parallelization of the V-cycle in a multigrid 
algorithm 

Efficient implementation of multigrid algo¬ 
rithms on a V-tree multiprocessor system 



Figure 6. The architecture of a V-tree architecture constructed 
from two augmented trees joined at the roots 















Figure 4. A sequential multigrid algorithm has a V-cycle of 
successive projections from fine to coarse grids and 
a sequence of injections in the reverse direction 



Figure 5. Concurrent multigrid algorithm 





























One tree is devoted to the projection se¬ 
quence on the V-cycle and the other to the 
injection sequence. 

Both parallelism and vectorization are ex¬ 
ploited on the V-tree. 

Higher throughput and better processor uti¬ 
lization are achieved on the V-tree. 























